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Abstract: We propose Bayesian model averaging (BMA) as a method for postpro¬ 
cessing the results of model-based clustering. Given a number of competing models, appro¬ 
priate model summaries are averaged, using the posterior model probabilities, instead of 
being taken from a single “best” model. We demonstrate the use of BMA in model-based 
clustering for a number of datasets. We show that BMA provides a useful summary of the 
clustering of observations while taking model uncertainty into account. Further, we show 
that BMA in conjunction with model-based clustering gives a competitive method for 
density estimation in a multivariate setting. Applying BMA in the model-based context 
is fast and can give enhanced modeling performance. 
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dimensional data • Model-based clustering • Model uncertainty 


1 Introduction 

Model-based clustering methods are based on the assumption that the population 


can be modeled using the fini te mixture model (e.g. Banfield and Raftery, 1993 


Celeux and Govacrt] 1995; Fraley and Raftery, |2002| ). Within this paradigm, it is 
assumed that the data come from G subpopulations, which correspond to the mix¬ 
ture components, and within each subpopulation the data are modeled using a single 
parametric component distribution. The most common finite mixture model that is 
used for model-based clustering is the finite normal mixture, but many alternatives 


exist (e.g. McLachlan and Peel, 2000; Lee and McLachlan, 2013b). 

Model selection is an intrinsic part of model-based clustering. In particular, the 
number of clusters (component densities), G, is unknown and a number of competing 
choices for component densities may also be under consideration. Each combination 
of component density and number of clusters can be viewed as a separate model, and 
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a model selection approach can be used to select both at the same time (e.g. Fraley 


and Raftery, 2002). In most implementations of model-based clustering, the “best” 


model is chosen by using some criterion and clustering is based on the “best” single 
model. A number of methods have been proposed for selecting the “best” model 
including choosing the model with the highest Bayesian Information Criterion (BIC) 


(Schwarz, 1978) or the highest Integrated Complete Likelihood (ICL) (Biernacki 


et ah, 2000). 


The approach of reporting the results of model-based clustering based on a single 
model ignores the uncertainty that arises from the model selection. Consequently, 
the uncertainty about quantities of interest may be underestimated. We propose 
basing the results of model-based clustering on a combination of the results from 
all candidate models rather than on those from a single model. We propose taking 
a weighted average of the model summaries, where the weights are approximate 
posterior model probabilities. Thus we propose using Bayesian Model Averaging 
(BMA) (Hoeting et ah, 1999) within the model-based clustering paradigm. 

To obtain valid inference using BMA, the quantity of interest should have the 
same meaning in each model under consideration. In model-based clustering, the 
quantities of interest must have the same meaning for all values of G and must be 
invariant to the labeling of the clusters in the finite mixture model. This is because 
finite mixture models are identifiable only up to permutations of the cluster labels. 
Here we focus on inference about the clustering consensus matrix. This has the 
same meaning for all values of G and is invariant to the cluster labeling. 

We also consider using model-based clustering as a method for multivariate den¬ 
sity estimation, following Fraley and Raftery (2002). In this case, the estimated 
density has the same meaning in all models, so we again use the posterior model 
probabilities as weights to offer an alternative density estimation procedure to multi¬ 
variate kernel density estimation (e.g., Scott, 1992; Duong, 2007) or to model-based 


clustering density estimation methods based on a single model (Fraley and Raftery 


2002 ). 


The paper is organized as follows. In Section [2| we briefly review the model- 
based clustering paradigm. In Section [3j we outline how the BMA approach can 
be used to deal with model uncertainty. In Section [4j we describe how we can 
create a matrix for each model that is invariant to the number and labeling of 
the clusters. In Section [5j we provide an assessment of clustering performance for 
BMA in conjunction with model-based clustering. In Section [6| we describe the 
background for density estimation and introduce a framework for combining BMA 
and model-based clustering for multivariate density estimation. We illustrate this 
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with a number of simulations. We conclude, in Section[7j with a discussion of related 
work and suggest future directions. 


2 Model-based clustering 


Model-based clustering ( 

Banfield and Raftery, 

1993; 

Celeux and Govaert, 

1995; 

Fra- 

ley and Raftery, 

2002, 

2007 

) is used for clustering data into groups, where the num- 


ber of groups G is typically unknown. Here we focus on model-based clustering 


based on the finite normal mixture model as described in Fraley and Raftery (2002). 


We assume that there are G clusters, where each cluster g arises with probability 
r g . Data within each cluster follow a normal distribution with cluster-specific mean 
H g and covariance E g , so that the data are characterized by a finite mixture of 
normal distributions. The density of each observation x t is 


f( x i ) = ^Tg<t>{Xi\ fig,Eg), 

9=1 

where </>(• |-, •) is a multivariate normal density. In practice, the number of clusters, 
G , is usually unknown and needs to be determined as part of the model inference. 
The assumption of multivariate normal distributed clusters implies that the clus¬ 


ters are elliptical in shape. Banfield and Raftery (1993) proposed that constraints 


be placed on the covariance matrices to gain parsimony. These are specified using 
a modified eigenvalue decomposition of E g , namely 

'Eg = XgDgAgD], 


where X g is a constant which controls the cluster volume, D s is an orthogonal matrix 
of eigenvectors which control the orientation of the clusters, and A g a diagonal 
matrix, with entries proportional to the eigenvalues, which control the shape of the 
cluster. 

We can restrict each property of the covariance E g (volume, shape, orientation) 


in different ways, resulting in fourteen different possible models (Biernacki et al. 


2006). Throughout this paper, we will consider the ten covariance structures imple¬ 


mented in the mclust software (Fraley et al., 2012), as displayed in Figure [I] Each 
letter in the name of a model corresponds to the constraint placed on the volume, 
shape and orientation respectively. The constraint can be equal (E), variable (V) or 
identity (I). For example, in the EEV model, each cluster has the same volume and 
the same shape but the orientations of the clusters can differ. Other parametriza- 
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Figure 1 Examples of cluster shapes under each covariance restriction. 


tions of covariance matrices are useful in the context of model-based cluster analysis 
(e.g. McNicholas and Murphy, 2008, 20101 Biernacki and Lourme, 2014), but we do 


not consider them further here. 

Based on choosing a covariance constraint and the number of groups G, we can 


fit the finite mixture of normal distributions using the EM algorithm (Dempster 


et al., 1977) to yield the maximum likelihood estimates of the model parameters 


and an estimate of the posterior probability that each observation x % belongs to 
group g. The resulting N x G matrix of probabilities, denoted by Z, has typical 
entry z tg , which is the estimated conditional posterior probability that observation 
i belongs to group g. This is 


P{Observation i comes from group g} 


z ig ~ 


fg0(Xi|Ag,S g ) 

I]n'=l 79 ,( H X *I/V’ 


( 1 ) 


where {(r 3 , ji y , S g ) : g = 1,2 ..., G} are the maximum likelihood estimates of the 
model parameters. The probability matrix, Z, can be converted into group labels 
for the observations and thus the observations can be clustered. 

In a clustering problem where several possible values for G are considered (e.g. G = 
1, 2,..., 9) and when all ten model types are also considered, we have to choose be¬ 
tween about 83 different models. However, we would expect only a small number 
of these models to be strongly supported by the data. A frequently-used model 
selection approach is to choose the “best” model using the Bayesian Information 


Criterion (BIC), (Schwarz, 1978) where the BIC for model A4 m is given by 


BIC m = 21og(£) - K m log(AT). 


( 2 ) 


In Equation [2j C is the maximized likelihood of the data, n m is the number of 
estimated model parameters for model Ai m , and N is the number of observations. 


This approach was proposed for clustering by Dasgupta and Raftery (1998), and 
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has been found to perform well (e.g. 

Once the model has been chosen, the cluster membership matrix is based on the 
parameters for that model alone. The other competing models are then discarded 
and model uncertainty is ignored. 


Steele and Raftery, 2010 


3 Model Uncertainty and Bayesian model aver¬ 
aging 


Basing inferences on a single “best” model ignores uncertainty about what the best 
model is. This can result in underestimating uncertainty about quantities of interest 
such as cluster membership or the estimated density. 

We address this using Bayesian model averaging (BMA) 

, which proceeds as follows. If {A4 1 ,.... At m} 
denotes the set of all models being considered and if A is the quantity of interest, 
then the posterior distribution of A given the data is 

M 

P(A|Data) = P(A|A4 m , Data)P(A4 m |Data). (3) 

m= 1 


and Raftery, 1994; Hoeting et al., 1999 


( Leamer[|1978[ [Madigan 


This is an average of the posterior distributions under each model weighted by the 
corresponding posterior model probabilities. 

In Equation [3| the posterior probability of model Ai m is given by 


P(_A4 m |Data) 


P(Data|Ad m )P(A4 m ) 
P(Data|A4 m ')P(A4 m ') ’ 


where 


P(Data|A4 m ) = / F(Da,tei\9 m , M m )F(9 m \M m )d9 ri 


( 4 ) 


is the marginal likelihood of model Ai m , 9 m is the vector of parameters of model 
A4 m , P(Data|d m , A4 m ) is the likelihood for model A4 m , P(#fc|A4fc) is the prior density 
of the parameter 9 m in model A4 m and P(A4 m ) is the prior probability of model A i m . 
All probabilities are conditional on the set of all models being considered. Madigan 
and Raftery (1994) argued that averaging over all of the models, as in Equation [3j 
provides better predictive ability than using any single model. 

One difficulty is that the integral in Equation |4] is intractable in most cases 
and so an approximation to the posterior model probability is required. We use 
the Bayesian Information Criterion (BIC) to derive approximate posterior model 


probabilities for Bayesian model averaging for model-based clustering (Dasgupta 
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and Raftery, 1998). If all the models are equally likely a priori , so that P(A4i) = 
• = P (M-m) = 1/M, this yields 


P(Al m |Data) 


exp ( \BICrn)_ 

E !'=1 ex P 


( 5 ) 


4 Bayesian model averaging for clustering 

An important point when implementing BMA is that the quantity of interest, A, 
must have a consistent definition across all models. Care should be taken when 
implementing BMA in the clustering setting because the labeling of clusters (com¬ 
ponents) within a model is arbitrary, so the labeling of clusters under one model may 
not correspond to the labeling under another model. Also, the number of clusters 
can vary from model to model. 


4.1 Similarity matrix 

Strehl and Ghosh (2003) introduced the concept of a binary similarity matrix in the 
context of clustering. This matrix is N x N and for any pair of observations 
the (i,j)th entry in the matrix is 1 if the ith and jth observations are in the same 
cluster in a given model, and 0 if they are not. They used this structure to form 


what they call pairwise duster ensembles ; Monti et al. (2003) and Kuncheva and 


Hadjitodorov (2004) also exploited this idea in a clustering context. In these articles, 


weights were given to various clustering methods and the similarity matrices were 
combined using these weights. The resulting matrix was described by|Kuncheva and 


Hadjitodorov 

(2004 

as a consensus matrix. 


Fern and Brodley 

(2003 

) extended the binary similarity matrix to soft-clustering 


techniques, which return a probability vector P(g|z, M. m ),g = 1,..., G for an obser¬ 
vation i, representing the probability that i belongs to each cluster under model Ai m . 
The values P(<?|i, AT,,,) are analogous to the z ig values, as defined in Equation]!} Fern 
and Brodley] (2003) used model-based clustering to cluster high-dimensional data by 
randomly projecting the data into a low-dimensional space and clustering the data in 
the low-dimensional space. The data are projected multiple times and the consensus 
matrices are averaged across all projections. 


Fern and Brodley (2003) defined S™ as the probability that observations i and 


j belong to the same cluster under model A4 m . This can be calculated as 


G G 

S’” = £ P(g\i, M m ) x P (g\j, M m ) ~ ]T 4 

9 = 1 9 =1 


m m 

ig Z j9 ’ 
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because z^ is an estimate of the probability that observation i belongs to cluster g 
in model A4 m . 

We can construct the similarity matrix S m for each model Ai m as follows: 

s ,„ = f (Z”(Z m n, • when i ft j 
lJ | 1 , when i = j. 

We propose using the matrix S m of probabilities that any pair of observations belong 
to the same cluster, when implementing BMA for model-based clustering. This 
ensures that we are averaging a quantity that has the same meaning across differing 
number of clusters and is invariant to cluster labeling. 

4.2 Properties of the similarity matrix 

The matrix S m for any model A4 m will be IV x iV where N is the number of ob¬ 
servations. It is invariant to label switching, and its dimension is invariant to the 
number of clusters in the model. It can therefore be used to combine models with 
different numbers of clusters. It can be viewed as a similarity matrix between the 
data points (aq,..., xn). 

The element s- of the matrix S m represents the probability that i and j belong 
to the same cluster. We can also define a quantity dij = 1 — s l3 as the probability 
that they are not in the same cluster. So, S = 1 — D, where D can be thought of 
as a dissimilarity matrix. Therefore, D can be used with any clustering algorithm 
which operates directly on a dissimilarity matrix. 

We define .sq as the minimum probability that two observations i and j in a 
set A belong together, which we can think of as the minimum probability that two 
elements in A belong to the same cluster. If we define d^ as 1 — .sq, we obtain the 
following result: 


dj[ 1 

= 1 — min P {i,j belong together} 

i,jeA 

= max[l — P {i,j belong together}] 

i,jeA 

= ma x(dij). 


Thus the maximum probability that two elements of A do not appear in the 
same cluster is dy\. It follows that D can be used in hierarchical clustering with 


complete linkage (Sokal and Sneath, 1963) and the results will have an intuitive 
interpretation. In complete linkage, the dissimilarity between two groups G and H 
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is the largest dissimilarity between opposite groups, that is, the maximum distance 
between an element of G and an element of H, namely 

dcomp\ete(,G , H^) IB.&X (Ijj . 

iG:G ,j GlH 

Hierarchical complete-linkage clustering will then merge the groups with the smallest 
^complete- This continues until all the groups are merged. 

The results can be shown on a dendrogram which can be cut at any level. There 
is an intuitive interpretation of the level of cut. If we cut the dendrogram at a 
particular probability level, any observations clustered together at that level have a 
probability of at least that value of all being in the same cluster. 

We illustrate this with a toy example. Suppose we have six data points A ,..., F 
and two equally likely clustering results. The first clustering attempt puts A, B. C 
in one cluster and D , E, F in the other, while the second attempt puts A, C, E in one 
cluster and B , D , F in the other. For simplicity we will use hard clustering where 
the probability of cluster membership is either 0 or 1 for each observation. If both 
models have equal probability, the S matrix will be 


G.O 

0.5 

1.0 

0.0 

0.5 

o.o\ 

0.5 

1.0 

0.5 

0.5 

0.0 

0.5 

1.0 

0.5 

1.0 

0.0 

0.5 

0.0 

0.0 

0.5 

0.0 

1.0 

0.5 

1.0 

0.5 

0.0 

0.5 

0.5 

1.0 

0.5 

^0.0 

0.5 

0.0 

1.0 

0.5 

1°7 


If we proceed to implement hierarchical clustering on the corresponding dissimilarity 
matrix, with complete linkage, we get the dendrogram shown in Figure [2} 

Averaging the two models and cutting the resulting dendogram at any probability 
level above 0.5 leads to a four-cluster solution. This makes sense. Both clusterings 
agree that A and C belong in the same cluster as do D and F. However, they 
disagree on B and E. This dendrogram gives a probabilistic representation of this 
situation. If we cut the dendogram in Figure [2] at 0.2, we see that A, B and C have 
at least a probability 0.2 of being in the same group, but cutting at 0.8 shows that 
A and C have at least a probability of 0.8 of being in the same group. 

4.3 Summary of method 

We can now carry out Bayesian model averaging by using the posterior model prob¬ 
abilities estimated from Equation [5] as weights and the similarity matrices for each 
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Figure 2 Toy data: Dendrogram denotes the probability of certain observations 
being in the same group. For example if we cut the dendrogram at the blue line, 
any observations clustered together have a probability of at least 0.8 of all being in 
the same group. If we cut at the red line, any observations clustered together at 
that level have a probability of at least 0.2 of all being in the same group. 


candidate model, defined in Equation [6} For each pair of observations i and j in our 
dataset, we propose assigning the probability vector 

P {Observation i , j in same cluster | Data} 

M 

= ^^P{Obs i,j in same cluster \A4 m }F {A4 m \ Data} 

771 = 1 

Em=i Eg=i exp (\BIC m ) 

Em=l exp (\BICm) 


5 Results 


The proposed methodology is demonstrated using two well known data sets: Fisher’s 


iris data (Fisher, 1936) and Forma’s wine data (Forina et ah, 1986). 


We clustered the datasets using the mclust software (Version 4.4) (Fraley et al. 


2012) and R (R Core Team, 2014). Where appropriate we have ordered the obser¬ 


vations using the gclus R package (Version 1.3.1) (Hurley, 2012) and the seriation R 
package (Version 1.0-14) (Hahsler et al., 2014). 

When using the default settings in mclust, a total of 83 candidate models were 
fitted. These include three one-cluster models (G — 1) and the ten possible covari¬ 
ance structures (Figure [l]) for each number of clusters G — 2, 3,..., 9. One can of 
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course fit a larger or smaller set of models if desired. 


5.1 Clustering Fisher’s iris data 


The iris data (Fisher, 1936j) gives the measurements in centimetres of the variables 
sepal length and width and petal length and width for 150 observations with 50 
of each of three species of iris: versicolor, setosa and virginica. Table [l] shows the 
model-based clustering models with the highest BIC values for the iris data. These 
results show that the VEV model with two clusters has the highest BIC, but that 
there is considerable uncertainty about whether this is the best model. In a single 
model scenario, the VEV model with two clusters would be selected and clustering 
would be based solely on this model. 


Table 1 Iris data: BIC and posterior model probabilities for the three most favored 
models, i.e., those with the highest BIC. The boldfaced model is chosen by this 
criterion. 


Model 

Type 

No of 
Clusters 

BIC 

Posterior 

Model Probability 

VEV 

2 

-561.73 

0.601 

VEV 

3 

-562.55 

0.398 

VVV 

2 

-574.028 

0.001 

Others 



< 0.00001 


However, it can be seen from Table [lj that the BIC value for the three-cluster 
VEV solution is similar to that for the selected model. The posterior model prob¬ 
abilities in the table are estimated using Equation [5] and we see that the posterior 
probability for this second best model is almost 40%. 

In order to visualize the observations that are likely to be in the same cluster, we 
use a heatmap to represent the similarity matrix. The ordering of the observations 
in the heatmap is important. Here we present the data in species order, which yields 
an intuitive heatmap, as shown in Figure [3| However, in many applications this will 
not be the case, but the structure may be easier to see if the data are ordered using 
some seriation method. 


Figure 3(a) shows the heatmap for the similarity matrix for the two-cluster VEV 
model, while Figure |3(b)| shows the three-cluster model; these are the two models 
with the highest BIC values. The red sections of the figure represent clusters of (i,j) 
pairs that have a high probability of being in the same cluster, which the white areas 
show ( i,j ) pairs that are unlikely to be in the same cluster. The clusters are isolated 
along the diagonal of the heatmap. 
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(a) (b) (c) 

Figure 3 Iris data: (a) Models with the highest BIC (VEV with two clusters) and 
(b) with the second highest BIC (VEV with three clusters) represented by heat maps 
of the similarity matrices. The data are in species order. The redder the color the 
more likely that pairs of observations appear in the same cluster, (c) represents the 
combination of models using BMA. 


The model that would be selected by BIC separates the setosa species very well 
from the other two species but it merges the two other species into one cluster. It can 
be seen also that the probabilities assigned to the co-clustering pairs are very high. 
The fact that there is large uncertainty associated with the two-cluster solution is 
not clear in the single model results. 

The second most likely model shows separation between the three clusters with 
high probability, as we see in Figure |3(b) . Thus the argument can be made that 
this model should contribute to the final clustering result. 

Figure 3(c) shows the heatmap for the similarity matrix produced from the 
Bayesian model averaging process. The method separates the large cluster into 
two clusters which approximate the known species groups quite well, but also reflect 
the uncertainty about whether there are really three species. 


We can also present these results using dendrograms, as detailed in Section 4.2 


and these are shown in Figure [4| The dendrograms show that the single model 
results cluster the observations into two clear groups, whereas the BMA results 
show the possibility of both a two-cluster and/or three-cluster solution, depending 
on the cutoff used. 


5.2 Clustering Italian wines data 

We now apply the methodology to a well-known dataset consisting of 27 chemical 
measurements on each of 178 wine samples belonging to three cultivars of wine 


|eTah|[l986l ). Table @ shows the number of samples collected in each vintage year 
for the three cultivars. 


(Barolo, Grignolino and Barbera) 
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Setosa 

Versicoli 

Virginia 



(a) (b) 

Figure 4 Iris data: Dendrograms of dissimilarity matrices for the iris data for the 
model with (a) the highest BIC and (b) for the BMA solution. Cutting dendrogram 
(b) at a level of 0.5 (red line) and 0.75 (blue line) give a probability of at least 
0.5 that the two clusters belong together, and a probability of at least 0.75 that 
the observations in the three branches of the tree belong in their three clusters 
respectively. 


Table 2 Wines data: A list of the number of samples of each cultivar for each 
vintage year investigated in the study. 


Cat.index 

Cat.name 

Samples per year 

’70 71 72 73 74 75 76 77 78 79 

Total 

1 

Barolo 

19 20 20 

59 

2 

Grignolino 

9 9 7 9 16 9 12 

71 

3 

Barbera 

9 5 29 5 

48 


Table [3] shows the candidate models with the highest BIC and hence the highest 
approximate posterior probability; all other models had negligible posterior proba¬ 
bility. Although the data consist of samples from three different cultivars, a seven- 
cluster model was preferred to any of the three-cluster models. The seven clusters 
successfully separate the three cultivars, and also partly reflect the different vintage 
years shown in Table [2] (McNicholas and Murphy, 2008). 

The heatmap of the similarity matrix of the optimal model is presented in Fig¬ 
ure |5(a)| Here the observations are shown in order of vintage year within cultivar. 
It appears that the clustering results partly reflect the vintage years as well as the 
cultivars. 

We seriated the data to reorder the observations, using the order of the leaf 
nodes in a dendrogram produced by hierarchical clustering with complete linkage. 
In hierarchical displays, a decision is needed at each merge to specify which subtree 
should go to the left and which to the right (Hurley, 2012). We used the order 
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Table 3 Wines data: BIC and posterior model probabilities for the three models 
with the highest BIC. Again, the boldfaced model is chosen when carrying out 
model-based clustering. 


Model 

Type 

No of 
Clusters 

BIC 

Posterior 

Model Probability 

VEI 

7 

-23951.91 

0.600 

EVI 

3 

-23953.87 

0.225 

VVI 

3 

-23954.37 

0.175 

others 



< 0.001 



(a) (b) 

Figure 5 Wines data: (a) Highest BIC model (VEI with seven clusters) represented 
by heat map of the similarity matrix, in vintage year within cultivar order and (b) 
in seriated order. 


suggested in Gruvaeus and Wainer (1972), which ensures that objects at the bound¬ 
aries of each class were located next to objects outside the class which they most 
resembled (Gordon, 1987). At a merge of clusters A and B, the new cluster is one 
of ( A, B ), (A',B), ( A, B '), (A',B'), where A' denotes A in reverse order. The new 
cluster is chosen to minimize the distance between the object in A placed adjacent 
to an object from B. The reordered similarity matrix, shown in Figure |5(b)[ has 
seven clear clusters, with uncertainty about the group membership of some of the 
observations. 

From Table [3] we see that the combined posterior probability of the two three- 
cluster models (EVI and VVI) is high, at just under 40%. We use BMA to average 
the similarity matrices across the candidate models and show the results in Figure [6| 
Figure |6(a)| shows the clustering in the originally presented order, after BMA has 
been carried out. It gives a clearer picture of the cultivar clustering, although there 
is some uncertainty, shown by the yellow colors. Figure 6(b)| shows the seriated 
version, where the smaller clusters group naturally into three larger clusters, but 
with one small cluster that could belong to either of two cultivars. 
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(a) (b) 

Figure 6 Wines data: Heat maps after Bayesian model averaging denoting the 
combined similarity matrix (consensus matrix) (a) in vintage year within cultivar 
order and (b) in seriated order. 
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(a) (b) 

Figure 7 Wines data: Dendrograms of dissimilarity matrices for (a) the model 
with the highest BIC (VEI with seven clusters), and (b) the BMA solution. The 
horizontal dashed lines indicate the partitions that arise by cutting the dendogram 
at various levels. The Grignolino cultivar is the red one on the right. 


Figure [7] shows the dendrogram of the complete-linkage clustering of the consen¬ 
sus matrices according to (a) the best model and (b) BMA. The dendogram for the 
best model (VEI with seven clusters) reproduces the model’s seven clusters quite 
clearly if the dendogram is cut at any level above about 0.5, as expected. The BMA 
dendogram reflects more uncertainty, as one would expect. It divides the data into 
four groups if cut at 0.1. 

The two three-cluster solutions give similar clustering results. In the seven- 
cluster solution, three of the clusters are for the Grignolino observations, and two 
each for the Barbera and Barolo observations. The Barolo and Barbera clusters 
break down by vintage year, whereas for the Grignolino observations the clusters 
correspond less clearly to vintage years. 
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6 Bayesian model averaging for density estima¬ 
tion. 

The fitted mixture model produced by model-based clustering provides an overall 


density estimate for the data generation process. Ferguson (1983) showed that finite 


mixtures of normal distributions can approximate any distribution on the real line 
to within any given accuracy. Thus the density estimate produced by model-based 
clustering can be used as a density estimation method. The performance of densities 


fitted in this way was assessed by Roeder and Wasserman (1997) for the univariate 


case and by Fraley and Raftery (2002) for the bivariate case. Both of these studies 


showed model-based clustering to be competitive with state of the art kernel density 
estimation methods. 

We propose using Bayesian model averaging of the density estimates for each 
model using the posterior model probabilities from Equation [5} So, given model- 
based density estimates fM m ( x )i for m = 1 , 2 ,..., M, we have 

M 


/bmaO) = Data }f Mm { x )- 


m =1 


We compare the BMA results with kernel density estimation methods. Kernel 
density estimation has long been used for density estimation and visualization of 


univariate data (Rosenblatt, 1956; Parzen, 1962; Jones et al., 1996). The kernel 
density estimate fh of a univariate density / based on a random sample Ah...., W\- 
of size N is 


fh(x) = 


where Kh(-) = ( l/h)K(-/h ) for a kernel function K, assumed to be a symmetric 
probability density and a bandwidth h (the smoothing parameter). We also compare 
/bma with the single-model estimate /sm, where 

fsu(x) = ImJ x ) 

where j\4 m is the model with the highest value of BIC. (SM stands for “single 
model”.) 


N 


2=1 

N / v 
X^- K ‘ x ~ Xi 

h 


1 

N 


i=l 


h 
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Wand and Jones (1994) described the extension of the method to the bivariate 


case. The multivariate kernel density estimate is 


1 N 

/>;H) = - ]Tw h (x-W), 

2=1 


where H is now a d x d positive definite matrix and K is a rbvariate spherically 
symmetric density function; a common simplification is to use a diagonal H (e.g., 


Wand and Jones, 1994, Chapter 4.2). 

The performance of density estimation procedures can be compared using mean 
integrated squared error (MISE) or expected Kullback-Leibler divergence (KL), de¬ 
fined as follows: 


MISE = E f ~ / [f(x) - f(x)] 2 dx 


MKL = W / log 


f( x ) 

K x ) 


f(x)dx, 


( 7 ) 

( 8 ) 


where the expected value is taken with respect to the resulting density estimate, /, 
computed from a random sample, Xi,X 2 ,..., X n , drawn from /. For both criteria, 
smaller is better. MISE (Equation [ 7 ]) is the most commonly used measure of per¬ 
formance, while Kullback-Leibler divergence (Equation [8]) provides an alternative 
which considers the differences in densities on a logarithmic scale; this places more 
emphasis on differences in regions of low density. 

The asymptotic performance of kernel density estimation under the MISE crite¬ 


rion was described by Silverman (1986) and Scott (1992) and the form of the optimal 


bandwidth for the kernel was derived. Hall (1987) studied the asymptotic perfor 


mance of kernel density estimation under Kullback-Leibler divergence and showed 
that the optimal bandwidth in this case leads to a smoother density estimate than 
under MISE. The optimal bandwidths differ because Kullback-Leibler puts a larger 
penalty on regions where a density estimate has low density but the true density has 


high density. Wand and Jones (1993) and Duong and Hazelton (2003) investigated 


the use of unconstrained parametrisations of the H matrix as opposed to diagonal 
ones on simulated target densities, and concluded that this can improve efficiency. 

We use the extended version of multivariate kernel density estimation as de¬ 
scribed in |Duong and Hazelton (2003) and Duong (2007) as a benchmark for com¬ 
parison with the model-based clustering density estimates. This approach uses a 
two-stage estimation procedure, where a pilot density estimate is used to get an 
improved estimate of the optimal bandwidth compared to the standard plug-in 
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estimates (e.g. Silverman, 1986, Section 3.4.2). Duong (2007) used kernels with 
non-diagonal H matrices. 


6.1 Density estimation results 

6.1.1 Simulation study for bivariate density estimation 


Fraley and Raftery 

(2002 

defined bivariate analogs of the first ten of the univariate 

datasets defined in 

Marron and Wand 

(1992 

to compare the performance of the 


fitted model-based clustering density with multivariate kernel density estimation. 
Contour plots of the densities are shown in Appendix [A] as well as the parameter 
settings for the actual densities we used. 


We produced 250 simulations of 250 observations from each of the same ten 
distributions and compared density estimation using model-based clustering with the 
Bayesian model averaged result in terms of the mean integrated squared error and 
the Kullback-Leibler distance. We compared the performance of the model-based 
clustering approaches to the extended kernel density estimator! method described 


Duong and Hazelton 

to 

o 

o 

GO 

as implemented in the ks R package ( 

Duong, 

2007 


2014). 


The results of the simulation study are shown in Table [4] where the numbers 
in the MISE columns are the MISE for kernel density estimation and for the single 
model respectively divided by the MISE for the BMA method. Similarly, for the 
Kullback-Leibler column, we divide the Kullback-Leibler distance for kernel density 
estimation and for the single model by the Kullback-Leibler distance for the BMA 
method. 
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Table 4 Mean MISE and Kullback-Leibler (KL) distance ratios for density estima¬ 
tion via model-based clustering with Bayesian model averaging (BMA) as against 
kernel density estimation with the ks R package (KS) and single-model model-based 
clustering (SM). Datasets used are the ten bivariate extensions of Marron-Wand 
univariate densities from Fraley and Raftery (2002). The numbers in the MISE 
columns are the MISE for kernel density estimation and for the single model respec¬ 
tively divided by the MISE for the BMA method; similarly for the KL columns. A 
value greater than 1.00 indicates that BMA is preferred to the competing method 
while a value less than 1.00 denotes that the competing method is preferred to BMA. 
(Results are based on 250 simulated datasets with 250 observations each for each 
density type.) 


Model 

KS/BMA 
MISE KL 

SM/BMA 
MISE KL 

Single Gaussian 

6.19 

4.86 

1.00 

1.00 

Skewed unimodal 

1.06 

1.53 

1.03 

1.03 

Strongly skewed 

2.75 

11.4 

1.04 

1.05 

Kurtotic unimodal 

0.45 

11.7 

1.00 

1.00 

Outlier 

1.81 

1.68 

1.00 

1.00 

Bimodal 

1.12 

1.46 

1.08 

1.08 

Separated bimodal 

3.10 

3.74 

1.01 

1.01 

Asymmetric bimodal 

0.55 

2.63 

1.00 

1.00 

Trimodal 

0.98 

0.99 

1.01 

1.03 

Claw (Bart Simpson) 

0.91 

1.37 

1.09 

1.19 


Density estimation using model-based clustering with BMA compares very favor¬ 
ably with kernel density estimation according to the KL criterion, while it compares 
favorably in the majority of cases by the MISE criterion. It also compares favorably 
with density estimation using model-based clustering with a single model in all cases 
and by both criteria, although the gains are more modest. 


Fraley and Raftery (2002) gave further comparisons of single model density es¬ 


timation with other kernel density estimation methods, namely Gaussian kernel 
density estimation using both the normal optimal bandwidth and cross-validated 


bandwidth (Bowman and Azzalini, 1997). 


6.1.2 Simulation study for higher dimensional density estimation 

We also investigated density estimation for higher dimensional data. We first added 
dimensions consisting of observations from the standard normal distribution to the 
existing distributions used in Section |6.1.1[ The first three rows in Table [5] refer to 
bivariate distributions with one extra such dimension added, while the next three 
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rows refer to the same distributions with four extra standard normal dimensions. We 
also implemented three- and six-dimensional versions of the bimodal distribution. 
We allowed for variable separation of the modes as in the bimodal case, and the 
results are in rows seven to twelve of Table [5j The settings used for all these 
simulations are in Appendix |A.2| 


Table 5 MISE and Kullback-Leibler (KL) distance ratios for density estimation 
via model-based clustering with Bayesian model averaging (BMA) as against kernel 
density estimation with the ks package (KS) and single model model-based clustering 
(SM). The top six lines comprise simulations of bivariate data as before with either 
one or four extra dimensions of standard normal. The bottom six lines comprise 
extensions to three and six dimensions of the bimodal density of F raley and R aftery 
(2002) with variable separation of the modes. 


Model 

KS/BMA 

SM/BMA 


MISE 

KL 

MISE 

KL 

Strongly skewed 3D 

2.46 

7.47 

1.03 

1.04 

Separated bimodal 3D 

5.08 

6.39 

1.00 

1.02 

Asymmetric bimodal 3D 

0.75 

3.53 

1.01 

1.00 

Strongly skewed 6D 

2.66 

4.95 

1.03 

1.04 

Separated bimodal 6D 

8.18 

11.0 

1.03 

1.03 

Asymmetric bimodal 6D 

1.41 

4.31 

1.06 

1.04 

Bimodal 3D (sep of 1.5) 

6.51 

6.45 

1.00 

1.03 

Bimodal 3D (sep of 3) 

1.94 

3.13 

1.03 

1.05 

Bimodal 3D (sep of 5) 

5.34 

6.30 

1.00 

1.00 

Bimodal 6D (sep of 1.5) 

12.3 

10.56 

1.06 

1.04 

Bimodal 6D (sep of 3) 

10.8 

13.61 

1.03 

1.04 

Bimodal 6D (sep of 5) 

8.41 

11.9 

1.11 

1.08 


Model-based clustering with BMA strongly outperformed kernel density estima¬ 
tion in all cases except one for both three-dimensional and six-dimensional data, 
according to both criteria, MISE and KL. Model-based clustering with BMA also 
uniformly outperformed single-model model-based clustering, although the gain was 
more modest. 

Density estimation using model-based clustering can be carried out for higher 
dimensions, and performs well. However, we limited ourselves to six dimensions 
because the ks R package provides results for at most six dimensions. We conjecture 
that model-based clustering’s gain in performance would be even greater for higher 
dimensions. 
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6.1.3 Lansing maples 


We now consider the Lansing trees data set (|Gerrard 1969) where we are interested 
in the density of the maple trees in a forest. The models with the highest BIC are 
shown in Table [6] and we can see that three models have non-negligible posterior 
probability. 


Table 6 Lansing Woods data: BIC and posterior model probabilities for the three 
models with the highest BIC. 


Model 

Type 

No of 

Clusters 

BIC 

Posterior 

Model Probability 

VII 

7 

154.339 

0.49462 

VEI 

7 

153.569 

0.33655 

VII 

6 

152.081 

0.15998 

others 



< 0.01 


A six-cluster solution has approximately 16% posterior model probability with 
two seven-cluster solutions having approximately 83% between them. There are 
very small probabilities associated with five and eight cluster solutions respectively. 
However, the best model that would be chosen according to BIC is VII with seven 
clusters. 

We can compare graphically the contour plots for the models with the highest 
BIC and the plot for the density estmation after BIC (Figure [8]). Figures |8(a)| |8(b) 


and 8(c) show the three most likely models according to BIC, while Figure [8(d)] shows 
the BMA density estimate. We can see that the high density region at approximately 
(0.4, 0.1) in Figures [8(a)| and |8(a)| does not appear in Figure |8(c)j Further, the 
density of this region is smaller in Figure 8(d)[ Thus the BMA density estimate 
takes into account the possibility that the model density plotted in Figure 8(c)| is 
appropriate. 

Figure [9] shows the differences between the density estimates produced by BMA 
and the density estimate for the VII model with seven clusters. Again, the difference 
in probability density around the cluster at (0.4,0.1) is evident. There are some other 
small areas of smoothing denoted in blue. The brown peaks are higher densities to 
compensate for the lower density at (0.4,0.1). 
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(a) Model VII with 7 clusters 


(b) Model VEI with 7 clusters 




(c) Model VII with 6 clusters (d) BMA 

Figure 8 Lansing Woods data: (a-c) Contour plots denoting the density estimates 
with the three highest model probabilities and (d) the Bayesian model averaged 
density estimate. The locations of the maple trees are overlaid. 
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Figure 9 Lansing Wood data: Difference between the densities for the model with 
highest probability and after BMA. 
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7 Discussion 


We have implemented Bayesian model averaging for model-based clustering and 
density estimation. The results show that the BMA assesses the model uncertainty 
better than the single best model approach. 

We have proposed similarity matrices as a way of representing ensembles of clus¬ 
tering solutions. They provide an intuitive way of visualising clustering solutions 
and are consistent across models with different numbers of clusters. They are also in¬ 
variant to label switching between models. Similar ideas were put forward by [Strehl 


and Ghosh (2003) and Monti et al. (2003) using binary classification matrices, while 


Fern and Brodley (2003) extended the idea to soft classifications when proposing 


their random projection clustering method. All of these papers use different ways 
of combining the matrices than BMA. 


Kuncheva and Hadjitodorov (2004) suggested that “cutting” the matrix at a 


certain threshold is equivalent to running the single link algorithm and cutting 
the dendrogram at that level. We argue that using complete linkage gives a more 
intuitive interpretation. 

We have shown how to apply Bayesian model averaging to clustering solutions 
using the similarity matrix. This provides a statistical postprocessing method which 
takes account of model uncertainty. When used on datasets with well-documented 
structure or a known ground truth, results achieved with Bayesian model averaging 
are consistent with the ground truth. We have shown that carrying out BMA on 
datasets that have no ground truth might give additional information than using 
model-based clustering alone and at little computational cost. 

The model-based density estimation method described in Fraley and Raftery 


(2002) gives better results in many cases than those given by well-known kernel 
density estimation methods for the simulated datasets analysed here. With the 
addition of the Bayesian model averaging described in this paper, even more im¬ 
provement can be seen, and this becomes more pronounced in higher dimensions. 


In the case where the derivative of the density estimate is of interest (e.g. Jones 


(1994) and others), a separate bandwidth is needed for each order derivative (Chacon 


and Duong, 2013). An advantage of the finite mixture density estimate and the BMA 


mixture density estimate is that a single estimate is produced which can be used for 
estimating derivatives of any order. 

The proposed methodology could also be used when more than one family of 
component distributions is under consideration. For example, the normal mixture 
models with eigendecomposed covariances used in mclust (Fraley and Raftery, 2002), 


the normal mixtures with factor analytic covariance structure used in pgmm (McNi- 
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cholas and Murphy, 2008), the multivariate t mixtures used in EM MIX (McLachlan 


and Peel, 1999) and the skew-t mixtures used in EMMIXuskew (Lee and McLachlan 


2013a) could be combined using the BMA procedure. 


One previous approach to Bayesian model averaging within model-based cluster¬ 


ing (Wei and McNicholas, 2014) considers noninvariance of model-based clustering 
results to cluster labeling by matching the clusterings for competing models using an 
cluster agreement criterion. Further, they combine the clusters in the larger model 
so that the models being combined have the same number of clusters, G. We have 
proposed a very different approach for Bayesian model averaging for model-based 
clustering by choosing a quantity of interest, A, that is invariant to cluster labeling 
and that has a common meaning for all values of G. 

Overall, our results suggest that Bayesian model averaging is a useful postpro¬ 
cessing tool for model-based clustering and density estimation. It often helps and 
seldom disimproves the results, so it could be used routinely as part of model-based 
clustering. 
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A Settings used for density simulations 

A.l Bivariate extensions of the Marron and Wand distribu¬ 
tions as used in Fraley et al. (2002) 

A. 1.1 Single Gaussian 



Figure 10 Gaussian distribution: Contour map denoting the density of a unimodal 
Gaussian distribution. 


Table 7 Gaussian distribution: Simulation settings. 


Cluster 

Prop 

Mean 

Covariance 


T 9 

(da) 

(sy 

1 

1 

M 

/1.25 0.75\ 

w 

y0.75 1.25y 
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A.1.2 


Skewed unimodal 



Figure 11 Skewed unimodal distribution: Contour map denoting the density of a 
skewed unimodal distribution. 


Table 8 Skewed unimodal distribution: Simulation settings. 


Cluster 


Prop 


Mean 

(Pg) 


Covariance 

(E p ) 


1/5 



1/5 


0.3535534 

.0.3535534, 


0.6804138 

0.4082483 


0.4082483 

0.6804138. 


3/5 


0.7660323 

.0.7660323. 


0.5176083 

.0.3105650 


0.3105650 

0.5176083. 
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A.1.3 


Strongly skewed 





Figure 12 Strongly skewed distribution: Contour map denoting the density of a 
strongly skewed distribution. 


Table 9 Strongly skewed distribution: Simulation settings. 


Cluster 


Prop 


Mean 

iftg) 


Covariance 

(Ep) 


1/8 



1/8 


-0.7071068 

-0.7071068, 


0.5555556 

,0.3333333 


0.3333333 

0.5555556, 


1/8 


-1.178511 

-1.178511, 


0.2469136 

,0.1481481 


0.1481481 

0.2469136, 


1/8 


-1.492781 

-1.492781 


0.10973937 
. 0.06584362 


0.06584362 

0.10973937, 


1/8 


-1.702294' 

-1.702294, 


0.04877305 
. 0.02926383 


0.02926383' 
0.04877305, 


1/8 


-1.84197\ 

-1.84197/ 


0.02167691 

.0.01300615 


0.01300615' 

0.02167691 


1/8 


-1.935086' 

-1.935086, 


0.009634183 

,0.005780510 


0.005780510' 

0.009634183, 


1/8 


-1.997164' 

-1.997164, 


0.004281859 

0.002569116 


0.002569116' 

0.004281859, 
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A.1.4 


Kurtotic unimodal 



Figure 13 Kurtotic unimodal distribution: Contour map denoting the density of a 
kurtotic unimodal distribution. 


Table 10 Kurtotic unimodal distribution: Simulation settings. 


Cluster 

Prop 

Mean 

Covariance 


T 9 

(f^g) 

(S 9 ) 

1 

2/3 

( 0\ 

/ 1.25 0.75\ 

Co) 

y0.75 1.25J 

O 

1/3 

M 

/ 0.03952847 0.02371708\ 

A 

w 

V0.02371708 0.03952847 J 
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A.1.5 


Outlier 



Figure 14 Outlier distribution: Contour map denoting the density of a distribution 
with outliers. 


Table 11 Outlier distribution: Simulation settings. 


Cluster 


Prop 


Mean 

(t^g) 


Covariance 

(S 3 ) 


1/10 



9/10 


0.03952847 

0.02371708 


0.02371708 

0.03952847, 
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Figure 15 Bimodal distribution: Contour map denoting the density of a bimodal 
distribution. 


Table 12 Bimodal Data: Simulation settings. 


Cluster 


Prop 


Mean 

(Pg) 


Covariance 

(gg) 


1/2 


-0.5303301' 

-0.5303301 


0.6804138 

-0.4082483 


—0.4082483 

0.6804138, 


1/2 


0.5303301 

.0.5303301 


0.6804138 

-0.4082483 


—0.4082483 

0.6804138, 
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A.1.7 


Separated bimodal 



Figure 16 Separated bimodal distribution: Contour map denoting the density of a 
separated bimodal distribution. 


Table 13 Separated bimodal distribution: Simulation settings. 


Cluster 


Prop 


Mean 

(Pg) 


Covariance 

(S g ) 


1/2 


— 1.06066 
-1.06066, 


0.6804138 

-0.4082483 


—0.4082483 

0.6804138, 


1/2 


1.06066' 

1.06066, 


0.6804138 

-0.4082483 


-0.4082483 

0.6804138, 
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A. 1.8 Asymmetric bimodal 



Figure 17 Asymmetric bimodal distribution: Contour map denoting the density of 
an asymmetric bimodal distribution. 


Table 14 Asymmetric bimodal distribution: Simulation settings. 


Cluster 


Prop 


3/4 


1/4 


Mean 

(Pg) 


0.7071068 

.0.7071068, 


Covariance 

(E p ) 



0.13888889 

-0.08333333 


-0.08333333' 

0.13888889, 
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A.1.9 


Trimodal 



Figure 18 Trimodal distribution: Contour map denoting the density of a trimodal 
distribution. 


Table 15 Trimodal distribution: Simulation settings. 


Cluster 


Prop 


Mean 

(Pg) 


Covariance 

(gg) 


2/5 


-0.8485281 

-0.8485281 


0.5809475 

-0.3485685 


-0.3485685 

0.5809475. 


2/5 


0.8485281 

.0.8485281 


0.5809475 

-0.3485685 


-0.3485685 

0.5809475. 


1/5 


0.15625 

-0.09375 


-0.09375 

0.15625, 
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A. 1.10 Claw (Bart Simpson) 



Figure 19 Claw distribution: Contour map denoting the density of the claw distri¬ 
bution. 


Table 16 Claw distribution: Simulation settings. 


Cluster 


Prop 


T„ 


Mean 

(/hr) 


Covariance 

(gg) 


2/7 


'0.625 

.0.375 


0.375' 

0.625. 


1/7 


-0.7071068' 

-0.7071068, 


0.03952847 

-0.02371708 


-0.02371708' 

0.03952847, 


1/7 


-0.3535534' 

-0.3535534, 


0.03952847 

-0.02371708 


-0.02371708' 

0.03952847, 


1/7 


0.03952847 

-0.02371708 


-0.02371708' 

0.03952847, 


1/7 


'0.3535534' 

.0.3535534, 


0.03952847 

-0.02371708 


-0.02371708' 

0.03952847, 


1/7 


'0.7071068' 

.0.7071068, 


0.03952847 

-0.02371708 


-0.02371708' 

0.03952847, 
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A.2 Higher dimensional distribution settings 

A.2.1 Additional standard normal columns 

For these results we merely added standard normal observations to the simulated bi¬ 
variate clusters. We did not differentiate between the clusters as cluster membership 
is not important in this setting. 

A.2.2 Three- and six-dimensional extensions 

Table 17 3D Bimodal Data: Settings used to simulate the density of a 3D bimodal 
distribution.Note that displacement means the distance from the origin - in this case 
along the line y = x. Separation as described in the results tables is the separation 
of the group centres. 



39 




Table 18 6D Bimodal Data: Settings used to simulate the density of a 6D bimodal 
distribution. Note that displacement means the distance from the origin - along the 
line x'i = x' 2 , as before. Separation as described in the results tables is the separation 
of the group centres. 


Cluster 

Displacement 

Prop 

T 9 

Mean 

(hg) 

Covariance 

(E») 




/ 

-0.5303301 

\ 


/3.0 

1.0 

0 

0 

0 

o\ 






-0.5303301 



1.0 

3.0 

0 

0 

0 

0 






0.0 



0 

0 

1 

0 

0 

0 


1 

0.75 

1/2 


0.0 



0 

0 

0 

1 

0 

0 






0.0 



0 

0 

0 

0 

0.5 

0 





V 

0.0 

/ 


0 

0 

0 

0 

0 

0.25 2 






^0.5303301^ 



/3.0 

1.0 

0 

0 

0 

0\ 






0.5303301 



1.0 

3.0 

0 

0 

0 

0 




1 /o 


0.0 



0 

0 

1 

0 

0 

0 


Z 


1/2 


0.0 



0 

0 

0 

1 

0 

0 






0.0 



0 

0 

0 

0 

0.5 

0 






V 0.0 j 



o 

0 

0 

0 

0 

0.25 2 






/— 1.06066\ 



3.0 

1.0 

0 

0 

0 

0\ 






-1.06066 



1.0 

3.0 

0 

0 

0 

0 






0.0 



0 

0 

1 

0 

0 

0 


l 

1.5 

1/2 


0.0 



0 

0 

0 

1 

0 

0 






0.0 



0 

0 

0 

0 

0.5 

0 






o.o y 



o 

0 

0 

0 

0 

0.25 2 






/1.06066\ 



3.0 

1.0 

0 

0 

0 

0\ 






1.06066 



1.0 

3.0 

0 

0 

0 

0 


o 


1 /o 


0.0 



0 

0 

1 

0 

0 

0 


Z 


1/2 


0.0 



0 

0 

0 

1 

0 

0 






0.0 



0 

0 

0 

0 

0.5 

0 






V o.o y 



V 0 

0 

0 

0 

0 

0.25 ^ 






1.767767 N 



/3.0 

1.0 

0 

0 

0 

0\ 






-1.767767 



1.0 

3.0 

0 

0 

0 

0 






0.0 



0 

0 

1 

0 

0 

0 


i 

2.5 

1/2 


0.0 



0 

0 

0 

1 

0 

0 






0.0 



0 

0 

0 

0 

0.5 

0 






, 0.0 , 



V o 

0 

0 

0 

0 

0.25 ^ 






1.767767\ 



/3.0 

1.0 

0 

0 

0 

°\ 






1.767767 



1.0 

3.0 

0 

0 

0 

0 


o 


1 /o 


0.0 



0 

0 

1 

0 

0 

0 


Z 


1/2 


0.0 



0 

0 

0 

1 

0 

0 






0.0 



0 

0 

0 

0 

0.5 

0 






0.0 y 



V 0 

0 

0 

0 

0 

0.25 ^ 
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